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KARL PEARSON'S META-ANALYSIS REVISITED 

By Art B. Owen 1 
Stanford University 

This paper revisits a meta-analysis method proposed by Pearson 
[Biometmka 26 (1934) 425-442] and first used by David [Biometrika 
26 (1934) 1-11]. It was thought to be inadmissible for over fifty years, 
dating back to a paper of Birnbaum [J. Amer. Statist. Assoc. 49 
(1954) 559-574]. It turns out that the method Birnbaum analyzed 
is not the one that Pearson proposed. We show that Pearson's pro- 
posal is admissible. Because it is admissible, it has better power than 
the standard test of Fisher [Statistical Methods for Research Workers 
(1932) Oliver and Boyd] at some alternatives, and worse power at 
others. Pearson's method has the advantage when all or most of the 
nonzero parameters share the same sign. Pearson's test has proved 
useful in a genomic setting, screening for age-related genes. This pa- 
per also presents an FFT-based method for getting hard upper and 
lower bounds on the CDF of a sum of nonnegative random variables. 

1. Introduction. Methods for combining p-values have long been stud- 
ied. Recent research in genomics [Zahn et al. (2007)] and functional mag- 
netic resonance imaging (fMRI) [see references in Benjamini and Heller (2007)] 
has sparked renewed interest. One gets a large matrix of p- values and con- 
siders meta-analysis along the rows, controlling for multiplicity issues in the 
resulting column of combined p-values. 

This paper revisits an old issue in meta-analysis. A genomic application 
lead to the reinvention of a meta-analysis technique of Pearson (1934). That 
method has long been out of favor because the paper by Birnbaum (1954) 
appears to have proved that it is inadmissible (in an exponential family 
context). This paper shows that the method is in fact admissible in that 
exponential family context. Being admissible, it is more powerful than the 
widely used combination method of Fisher (1932) at some points of the 
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alternative hypothesis. Such points turn out to be especially important in 
the motivating biological problem. 

The motivating work, reported in the AGEMAP study of Zahn et al. 
(2007), was to screen 8932 genes, searching for those having expression levels 
that are correlated with age in the mouse. Each gene was tested in m = 16 
tissues, yielding 16 regression coefficients. There was an 8932 x 16 matrix of 
p- values. The data were noisy and it was plausible a priori that most if not all 
genes could have little or no correlation with age. In addition to the recently 
well-studied issue of controlling for multiple testing over genes, there was a 
problem of pooling information from different tissues for any single gene. For 
issues of multiple testing see Dudoit and van der Laan (2008). This article 
focuses on the problem of basing a single test on m > 1 test statistics. 

A gene that is not age-related has a slope parameter of zero in all m 
tissues. For a gene that is age-related, we expect several nonzero slopes. 
Because gene expression is tissue-dependent, it is also possible that a gene 
correlated with age in numerous tissues might fail to correlate in some others. 
Therefore, assuming a common nonzero slope is unreasonable. It is even 
possible that a gene's expression could increase with age in some tissues 
while decreasing in one or more others. But we do expect that the nonzero 
slopes should be predominantly of the same sign. The prior understanding 
of the biology is not detailed enough to let us specify in advance how many 
tissues might be nonage-related or even discordant for an otherwise age- 
related tissue. 

Pearson's method is well suited to this problem. The better-known method 
of Fisher combines p- values by taking their product. It can be based on 
one-tailed or two-tailed test statistics. When based on one-tailed statistics, 
Fisher's test works best if we know the signs of the alternatives. When based 
on two-tailed statistics, Fisher's test does not favor alternatives that share 
a common sign. The test of Pearson (1934) may be simply described. It 
runs a Fisher-style test for common left-sided alternatives as well as one for 
common right-sided alternatives, and it takes whichever of those two is most 
extreme. It builds in a strong preference for common directionality while not 
requiring us to know the common direction, and it remains powerful when a 
small number of tests differ in sign from the dominant one. These properties 
fit the needs in Zahn et al. (2007). 

An outline of this paper is as follows. Section 2 defines the test statistics 
and hypotheses that we work with. Section 3 reviews some basic concepts in 
meta-analysis and compares Pearson's test graphically to the better-known 
competitors. Section 4 shows that Pearson's method is admissible in the 
exponential family context. It also shows that a simple factor of two Bonfer- 
roni correction is very accurate for tail probabilities based on Pearson's test 
statistic. Section 5 reviews the history surrounding the misinterpretation of 
Pearson's proposal. Part of the problem stems from mixing up p- values from 
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one and two-tailed tests. Section 6 compares the power of Pearson's test with 
some others, including tests that are based on the original test statistics, not 
just the p-values. Section 7 considers some of the recent literature on com- 
bining p- values. A discussion is in Section 8. The Appendix presents the 
numerical methods used to make the power computations. These are based 
on a kind of interval arithmetic for sub-distributions using the FFT. 

2. Notation. This section sets out the notation for the paper. First are 
the parameters, hypotheses and test statistics for the univariate settings. 
Then they are combined to define multivariate rejection regions and test 
statistics. 

2.1. Parameters and hypotheses. We consider a setting where there are 
m parameters 0i, . . . , (3 m and m corresponding estimates P\, . . . , f3 m . These 
estimates are random variables whose observed values are denoted by /3f bs ,... 
Pm S - In the motivating problem, these were regression slopes. We assume 
that the m statistics Pj are independent. Dependent test statistics are con- 
sidered briefly in Section 6.3. We also assume that Pj have continuous dis- 
tributions. 

For j = 1, . . . , m, we can consider the hypotheses 

H L>j :P j <0 

and 

Hu,j ■ Pj j= 0, 

based on the sign of Pj. These are the null hypotheses, left- and right-sided 
alternatives and an undirected alternative, respectively. 
Using (3° hs as test statistics, we may define 

p J =Pr0 J <(3f s \(3 j = O) 

and 

pj=Mfc\>\ff x m=o)- 

The p- values for alternatives H^j, Hrj and Huj, respectively, are pj, l — pj 
and pj = 2min(pj,l ~Pj)- A p- value for a one-tailed or two-tailed test is 
called a one-tailed or two-tailed p-value below. 

For the entire parameter vector (3 = (j3\, . . . ,P m ), it is straightforward to 
define the simple null hypotheses Hq for which Pi = @2 = • ■ • = Pm = 0. We 
do not consider composite null hypotheses. 
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For m > 1 , the alternatives to Hq are more complicated than for m = 1 . 
There are 3 m possible values for the vector of signs of (3j values and many 
possible subsets of these could be used to define the alternative. For example, 
one could take 

tf L = (-oo,0] m -{0} or H R = [0,oo) m -{0}, 

or their union. 

But any of these choices leaves out possibilities of interest. Therefore, we 
take the alternative Ha to be that (3j ^ for at least one j 6 {1, . .. , m}. 
That is, H A :/3eM. m -{0}. 

While all of W 71 — {0} is of interest, the parts with concordant signs are 
of greater interest than those with discordant signs. For example, with A > 
0, we want greater power against alternatives ±(A,A,...,A) than against 
other alternatives of the form (±A, ±A, . . . , ±A). In a screening problem, the 
former are more convincing, while the latter cause one to worry about noise 
and systematic experimental biases. The situation is analogous to the choice 
of the Tukey's versus Sheffe's statistic in multiple comparisons: both have 
the alternative of unequal means, but their power versus specific alternatives 
of interest could lead us to prefer one to the other in a given application. 

It is worthwhile to represent the vector (3 as the product t9, where 9 £ M. m 
is a unit vector, and r > 0. We may then consider the power of various tests 
of Hq as r increases. 

2.2. Rejection regions. The decision to accept or reject Hq will be based 
on pi, . . . ,p m . As usual, acceptance really means failure to reject and is not 
interpreted as establishing Hq. The rejection region is R = {(pi, ■ ■ ■ ,p m )\HQ 
rejected} C [0, l] m . For some of the methods we consider, the rejection region 
can be expressed in terms of the two-tailed p- values pj. Then we write R = 
{( Pl ,..., Pm )\H rejected} C [0,l] m . 

While formulating the region in terms of p- values seems unnatural when 
the raw data are available, it does not change the problem much. For ex- 
ample, if flj ~ AA(/?j,a|) with known <r|, then any region defined through 
Pj-values can be translated into one for /?j-values. In that case, we write 

R' = {(Pi, ■ ■ ■ ,P m )\HQ rejected}. It is more realistic for the p- values to come 
from a t distribution based on estimates of crj. For large degrees of freedom, 
the normal approximation to the t distributed problem is a reasonable one, 
and it is simpler to study. Discussion of t distributed test statistics is taken 
up briefly in Section 6.1. 

2.3. Test statistics. Under the null hypothesis pj, 1 — pj and pj all have 
the U (0, 1) distribution. It follows that 

(2.1) Q L = -2log(jlp- ) j, 
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(2.2) QR = - 2 \og (n(l-Pj)) and 



(2.3) Q y = -21og 

\j=i , 

2 



all have the X{2m) distribution under Hq. An a level test based on any of 

2,1— a 
(2m) 



these quantities rejects Hq when that Q is greater than or equal to x%, ' 



Their chi-square distribution is due to Fisher (1932). 

When m = 1, these three tests reduce to the usual one and two-sided tests. 
When m > 1 they are reasonable generalizations of one and two-sided tests. 

The test of Pearson (1934) is based on 

(2.4) Q c = max(Q L ,Q R ). 

If m = 1, then Q c = Q , but for m > 1 they differ. The superscript C is 
mnemonic for concordant. 

The null distribution of Q c is not X%,m) ■ However, a Bonferroni correction 
is quite accurate: 

(2-5) a-^<Pr(Q c > X ^ 2 )<a. 

Equation (2.5) follows from Corollary 1 in Section 4. For instance, when the 
nominal level is a = 0.01, the attained level is between 0.01 and 0.009975. 
Equation (2.5) shows that min(l, 2Pr(x^ 2m ) ^ Q C )) is a conservative p- 
value. Equation (2.5) shows that the accuracy of this Bonferroni inequality 
improves for small a which is where we need it most. 

The statistic Q is the natural one to use when the alternative is known 
to be in H^. But it still has power tending to 1, as r = \\/3\\ tends to infinity, 
so long as 9 = [3/t is not in Hr. Naturally, Q R has a similar problem in Hl 
while being well-suited for Hr. Neither Q c nor Q u have such problems. If 
we have no idea which orthant j3 might be in, then Q u is a natural choice, 
while if we suspect that the signs of large (in absolute value) nonzero (3j are 
mostly the same, then Q c has an advantage over Q u . 

2.4. Stouffer et al. 's meta-analysis. An alternative to Fisher's method 
is that of Stouffer et al. (1949), which is based on turning the p- values into 
Z-scores. Let <p(x) = exp(— x 2 /2) / ^/2tt denote the A/"(0, 1) density, and then 
let ${x)=5* 00 y(z)dz. 

We can define tests of Hq based on Z-scores via 
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(2.7) ^ = 7 =5>-i fe) , 

(2.8) ^ = ^El^" 1 fe)l and 

(2.9) S c = max(5 L ,,S fl ), 

which are directly analogous to the tests of Section 2.3. For independent 
tests considered here, S L and S R have the A/"(0, 1) distribution under Hq, 
while S c has a half-normal distribution and S u does not have a simple 
distribution. Note that S L = -S R and that S c = \S L \ = \S R \. 

3. Meta-analysis and a graphical comparison of the tests. This section 
reviews some basics of meta-analysis for further use. Then it presents a 
graphical comparison of Pearson's test with the usual tests, to show how it 
favors alternatives with concordant signs. For background on meta-analysis, 
see Hedges and Olkin (1985). 

It has been known since Birnbaum (1954) that there is no single best 
combination of m independent p- values. A very natural requirement for a 
combination test is Birnbaum's. 

Condition 1. If Hq is rejected for any given (pi, . . . ,p m ), then it will 
also be rejected for all (p*, . . . ,Pm) such that p*j < pj for j = 1, . . . , m. 

Birnbaum proved that every combination procedure which satisfies Con- 
dition 1 is in fact optimal, for some monotone alternative distribution. Op- 
timality means maximizing the probability of rejecting Hq, subject to a con- 
straint on the volume of the region R of vectors (p\, . . . ,p m ), for which Hq 
is rejected. Birnbaum allows simple alternatives that have independent pj 
with decreasing densities gj(pj) on < pj < 1. He also allows Bayes mixtures 
of such simple alternatives. Birnbaum shows that Condition 1 is necessary 
and sufficient for admissibility of the combination test, again in the context 
of decreasing densities. Here is his definition of admissibility: 

Definition 1 [Birnbaum (1954), page 564]. A test is admissible if there 
is no other test with the same significance level, which, without ever being 
less sensitive to possible alternative hypotheses, is more sensitive to at least 
one alternative. 

The top row of Figure 1 illustrates 4 rejection regions R satisfying Con- 
dition 1, arising from the Fisher test, the Stouffer test, a test based on 
min(pi, . . . ,p m ) and a test based on max(pi, . . . ,p m ), for the case m = 2. 
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Fig. 1. This figure shows rejection regions for a pair of tests. The top four images have 
coordinates {pi,p2) where pj near zero is evidence against Hoj in a two-tailed test. The 
columns, from left to right, are based on min(pi,p2), max(pi,p2), Fisher's combination and 
Stouffer's combination, as described in the text. Each region has area 1/10. The bottom 
row shows the same rejection regions in coordinates (pi,p2), where Pj near is evidence 
that f3j < 0, and pj near 1 is evidence that f3j > 0. 



Using the minimum is due to Tippett (1931), while the maximum is from 
Wilkinson (1951), who is credited with the more general approach of using 
an order statistic of the pj . 

The criterion min(pi,p2) leads us to reject Ho if either test 1 or test 2 
is strongly significant. The criterion max(pi,p2) is similarly seen to require 
at least some significance from both test 1 and test 2. Birnbaum's result 
opens up the possibility for many combinations between these simple types, 
of which Fisher's test and Stouffer's test are two prominent examples. 

Graphically, we see that Fisher's combination is more sensitive to the sin- 
gle smallest p- value than Stouffer's combination is. In the Fisher test, if the 
first m — 1 p- values already yield a test statistic exceeding the X?2m) signif- 
icance threshold, then the mth test statistic cannot undo it. The Stouffer 
test is different. Any large but finite value of J^jl^i < ^ _1 (Pj) can be canceled 
by an opposing value of $ (p m ). 

The bottom row of Figure 1 illustrates the pre-images (pi,P2), where 
Pj = 2min(pj, 1—pj) of the regions in the top row. Those images show which 
combinations of one sided p- values lead to rejection of Hq. Each region R 
in the bottom row of Figure 1 is symmetric with respect to replacing pj by 
1—pj, and so they do not favor alternatives with concordant signs. 

Figure 2 shows the rejection regions for concordant versions of the Fisher 
and Stouffer tests. Both of these regions devote more area to catching coor- 
dinated alternatives to Hq than to split decisions. Comparing the upper left 
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Fig. 2. This figure shows rejection regions R for concordant tests, Q c and S° , as de- 
scribed in the text. The left image shows a region for Pearson's Q c which is based on 
Fisher's combination. The right image shows a region for S° , based on Stouffer's com- 
bination. The x and y axes in these images correspond to one sided p-values pi and p2, 
rejecting Ho for negative slopes at the bottom and/or left, while rejecting H for positive 
slopes at the top and/or right. These tests are more sensitive to alternatives where all 
underlying hypothesis tests reject in the same direction than they are to split decisions. 
The Stouffer region extends to all corners but with a thickness that approaches zero. The 
Fisher region has strictly positive thickness in each corner. 



and lower right corners of these regions we see that the Stouffer version is 
more extreme than the Fisher test in rejecting split decisions. 

A naive reading of Condition 1 is that almost any p- value combination is 
reasonable. But some of those combinations are optimal for very unrealistic 
alternatives. Birnbaum (1954) goes deeper by considering alternatives in an 
exponential family, beginning with his second condition. 

Condition 2. If test statistic values 0i, . . . ,/3 m ) and (/?*,... do 
not lead to rejection of Hq, then neither does \({3i, ■ ■ ■ ,{3 m ) + (1 — ty(Pti • • • j Pm) 
for 0< A< 1. 

Condition 2 requires that the acceptance region, in test statistic space, 
be convex. If the test statistics being combined are from a one parameter 
exponential family, then Birnbaum (1954) shows that Condition 2 is nec- 
essary for the combined test to be admissible. When the parameter space 
is all of R m , then Condition 2 is also sufficient for the combined test to be 
admissible. This is a consequence of the theorem in Stein (1956), Section 
3. Birnbaum (1955) had this converse too, but without a condition on un- 
boundedness of the parameter space. Matthes and Truax (1967) prove that 
the unboundedness is needed. Thus Condition 2 is reasonable and it rules 
out conjunction-based tests like the one based on max(j>i, . . . ,p m ), and more 
generally, all of the Wilkinson methods based on p^ for 1 < r < m. 

Suppose that j3j ~ AA(/3j,l). Then Birnbaum (1954) shows that a test 
which rejects Hq when FJj=i(l ~~ Pj) 1S t°° large, fails Condition 2. For 
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Fig. 3. This figure shows nested decision boundaries in the space of test statistics 
P = (/3i, . . . ,P m ) for meta-analysis methods described in the text. We are interested in 
the regions where Ho is not rejected. From left to right they are: Fisher's combination 
Q u with lozenge- shaped regions, Pearson's combination Q c with leaf-shaped regions, a 
left-sided combination Q L with quarter-round- shaped regions north-east of the origin and 
Birnbaum's version of Pearson's region, having nonconvex plus-sign-shaped regions. In 
each plot the significance levels are at 0.2, 0.1, 0.05 and 0.01. 

small a and m = 2, that test has a nearly triangular rejection region R 
including (0,0) in the (p\,P2) plane. For small a it would not reject at 
(,Pi>P2) = (0,1) or even at (0,0.5). Birnbaum (1955) attributes that test 
to Karl Pearson through a description given by Pearson (1938). But Karl 
Pearson did not propose this test and it certainly is not Q c . It appears that 
Birnbaum has misread Egon Pearson's description of Karl Pearson's test. A 
detailed discussion of that literature is given in Section 5. 

Theorem 1 in Section 4 shows that Q c satisfies Condition 2 for normally 
distributed test statistics, and so it is admissible. Figure 3 illustrates some 
acceptance regions for Q u , Q , Q L and rTj=i(l ~Pj)- 

In applications, Fisher's method is more widely used than Stouffer's. In 
a limit where sample sizes increase and test statistics become more nearly 
normally distributed, some methods are maximally efficient in Bahadur's 
sense [Bahadur (1967)]. Fisher's method is one of these, and Stouffer's is 
not. See Table 3 on page 44 of Hedges and Olkin (1985). Both Birnbaum 
(1954) and Hedges and Olkin (1985) consider Fisher's method to be a kind 
of default, first among equals or better. 

4. Pearson's combination. In this section, we prove two properties of 
Pearson's combination Q c . The acceptance regions in the second panel of 
Figure 3 certainly appear to be convex. We prove that his test satisfies 
Condition 2 (convexity), for Gaussian test statistics. Therefore, it is in fact 
admissible in the exponential family context for the Gaussian case. The 
result extends to statistics with log-concave densities. Finally, we show that 
the Bonferroni bound on the combination is very accurate for small combined 
p- values. 

4.1. Propagation of admissibility to Q c . We consider first the setting of 
Gaussian test statistics j3j ~ N(Pj,a 2 /nj). For simplicity, we work in terms 
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of tj = y/njPj/a. Under Hq, the tj are i.i.d. jV(0, 1). The proof that Q 
is admissible uses rules that propagate convexity and quasi-convexity. It is 
broken into small pieces that are used for other test statistics in Sections 6 
and 7. 

Let Q(ti, . . . ,t m ) be a real- valued function on W 71 . The associated test 
rejects Hq in favor of Ha when Q > q. That test is admissible in the ex- 
ponential family context if {(ii, . . . ,t m )\Q < q} is convex and the natural 
parameter space is M. m . 

Lemma 1. For k = 1, . . . ,n, suppose that the test which rejects Hq versus 
Ha when Qk(t%, . .. ,t m ) > qk is admissible in the exponential family context 
with natural parameter space M m . Let the test based on Qk have level at- 
Then the test which rejects Hq when one or more of Qk(ti, ■ ■ ■ ,t m ) > qk hold 
is also admissible and has level a < J2t=i a k- 

Proof. Under the assumptions, the acceptance regions for the n indi- 
vidual tests are convex. The combined test has an acceptance region equal 
to their intersection which is also convex. Therefore, the combined test is 
admissible. The level follows from the Bonferroni inequality. □ 

Lemma 2. For k = l,...,n, suppose that a test rejects Hq versus Ha 
when Qk(ti, . . . , t m ) >qt, where Qk is a convex function on M m . Then the 
test which rejects Hq when Q(t\, . . . , t m ) = J2k=i Qk(ti> ■ ■ ■ > Un) — Q holds is 
admissible in the exponential family context with natural parameter space 
M. m . 



Proof. Under the assumptions, each Qk is a convex function, and there- 
fore, so is their sum Q. Then {(t\, . . . ,t m )\Q < q} is convex and so the test 
is admissible. □ 

Given a battery of admissible tests, Lemma 1 shows that we can run 
them all and get a new admissible test that rejects Hq if any one of them 
rejects. Lemma 2 shows that for tests that have convex criterion functions, 
we can sum those functions and get a test criterion for another admissible 
test. Lemma 2 requires a stronger condition on the component tests Qk 
than Lemma 1 does. Aside from the difference between open and closed 
level sets, the criterion used in Lemma 1 is quasi-convexity. Sums of quasi- 
convex functions are not necessarily quasi-convex [Greenberg and Pierskalla 
(1971)]. 

Theorem 1 . For t\ , . . . , t m G K m let 

(m m \ 

-2i og n -2 log n $ (-*i) ■ 
3=± 3=1 / 
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Then {(t\, . . . ,t m )\Q < q} is convex so that Pearson's test is admissible in 
the exponential family context, for Gaussian data. 

Proof. The function &(t) is log concave. Therefore, the test that rejects 
Hq when = — 2 log <£(£&) is large (ignoring the other m — 1 statistics) has 
a convex criterion function. Applying Lemma 2 to those tests shows that 
the test based on Q L = — 2J2JLi l°g(^K*i)) nas convex acceptance regions. 
The same argument starting with log concavity of 3> (— t) shows that the test 
based on Q R = -2E^ilog(*(-tj)) has convex acceptance regions. Then 
Lemma 1 applied to Q L and Q R shows that Q c has convex acceptance 
regions. □ 

The Gaussian case is concrete and is directly related to the motivating 
context. But the result in Theorem 1 holds more generally. Suppose that the 
density functions (d/dt)Fj (t) exist and are log concave for j = 1, . . . , m. Then 
both Fj(t) and 1 — Fj(t) are log concave [Boyd and Vandeberghe (2004), 
Chapter 3.5]. Then the argument from Theorem 1 shows that the test criteria 



Q L = "2 Ej log(Fj (tj )),Q R = -2J2 j log(l - Fj (tj ) ) and Q c = max(Q i , Q R ) 



all have convex acceptance regions. 

While many tests built using Lemmas 1 and 2 are admissible in the ex- 
ponential family context, it may still be a challenge to find their level. The 
next section takes up the issue of bounding a for tests like Q ' . 

4.2. Bonferroni. Here we show that the Bonferroni bound for Q c is very 
accurate, in the tails where we need it most. The Bonferroni bound for Q c 
is so good because it is rare for both Q L and Q R to exceed a high level under 
Hq . For notational simplicity, this section uses Xj in place of pj . 

Theorem 2. Let X = (Xi, . . . ,X m ) e (0, l) m be a random vector with 
independent components, and put 



and Q c = max(Q L ,Q R ). For AeR, let P L = Pv(Q L > A), P R = Pr(Q R > 
A) and P T = Pr{Q c > A) . Then P l + Pr- P L Pr < Pt < Pl + Pr- 

Corollary 1. Suppose that X ~ £7(0, l) m in Theorem 2. For A>0 let 
T A = Pi{ X 2 {2m) > A). Then 

It A ~r\< Pr(Q c > A) < 2t a . 



Q 



,L — 




Q 



t R = 
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PROOF. When X ~ U(0,l) m , then Pr(Q L > A) = Pr(Q R > A) = ta- 
The result follows by substitution in Theorem 2. □ 

Before proving Theorem 2, we present the concept of associated random 
variables, due to Esary, Proschan and Walkup (1967). 

Definition 2. A function / on R n is nondecreasing if it is nondecreasing 
in each of its n arguments when the other n — 1 values are held fixed. 

Definition 3. Let X±, . . . ,X n be random variables with a joint distri- 
bution. These random variables are associated if Cov(/ (X) , g(X)) > holds 
for all nondecreasing functions / and g for which the covariance is defined. 

Lemma 3. Independent random variables are associated. 

PROOF. See Section 2 of Esary, Proschan and Walkup (1967). □ 

Lemma 4. For integer m>l, let X = (Xi,. . . ,X m ) £ (0, l) m have in- 
dependent components. Set 

m m 

Q L = -21ogn^i and Q R = -2 log - Xj). 

Then for any Al > and Ar > 0, 

Pr(Q L > A L , Q R > A R ) < Pr(Q L > A L ) Yt{Q r > A R ). 

Proof. Let h(X) = 21o g n j m =i^i and H x ) = -21o g n^i(l - Xj). 
Then both fi and f2 are nondecreasing functions of X . The components of 
X are independent, and hence are associated. Therefore, 

Pt(Q l >A l ,Q r >A r ) 

= PT(-f 1 (X)>A L ,f 2 (X)>A R ) 

= Pv(f 1 (X)<-A L ,f 2 (X)>A R ) 

= Pv(f 2 (X) > A R ) - Pv(h(X) > -A L ,f 2 (X) > A R ) 

< Pr(/ 2 (X) > A R ) - Pr(/i(X) > -A L ) Pr(/ 2 (X) > A R ) 

= Pr(/ 2 (X) > A R ) Pr(/!(X) < -A L ) 

= Pt{Q r >A r )Pt{Q l >A l ). □ 

In general, nondecreasing functions of associated random variables are as- 
sociated. Lemma 4 is a special case of this fact, for certain indicator functions 
of associated variables. 
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Proof of Theorem 2. The Bonferroni inequality yields Pt < Pl + Pr- 
Finally, P T = P L + P R - Pr(Q L > A,Q R > A) and Pr(Q L > A, Q R > A) < 
PlPr from Lemma 4. □ 

Remark 1 . The same proof holds for combinations of many other tests 
besides Fisher's. We just need the probability of simultaneous rejection to 
be smaller than it would be for independent tests. 

5. History of Pearson's test. Pearson (1933) proposed the product 
nE=i Fo(Xi) as a way of testing whether i.i.d. observations X±, . . . ,X n are 
from the distribution with CDF Fq. He finds the distribution of the product 
in terms of incomplete gamma functions and computes several examples. 
Pearson remarks that the test has advantages over the x 2 test of goodness 
of fit: small groups of observations do not have to be pooled together, and 
one need not approximate small binomial counts by a normal distribution. 
Pearson also notices that the approach is applicable more generally than 
testing goodness of fit for i.i.d. data, and in a note at the end, acknowledges 
that Fisher (1932) found the distribution earlier. 

In his second paper on the topic, Pearson (1934) found a p- value for 
17, F(Xj) and one for Ylj(l — F(Xj)) and then advocated taking the smaller 
of these two as the "more stringent" test. Modern statisticians would instinc- 
tively double the smaller p- value, thereby applying a Bonferroni factor of 2, 
but Pearson did not do this. 

Birnbaum [(1954), page 562] describes a test of Karl Pearson as follows: 

"Karl Pearson's method: reject Hq if and only if (1 — m)(l — M2) ■■ ■ (1 — Ufc) > c, 
where c is a predetermined constant corresponding to the desired significance level. In 
applications, c can be computed by a direct adaptation of the method used to calculate 
the c used in Fisher's method." 

In the notation of this paper, (1 — ui)(l — U2) • • • (1 — Uk) is rij=i(l ~Pj)> f° r 
Figure 4 of Birnbaum (1954), and it is ]lj=i(l ~Pj) for Figure 9. The latter 
(but not the former) would lead to an admissible test, had the rejection 
region been for small values of the product. 

Birnbaum does not cite any of Karl Pearson's papers directly, but does 
cite Egon Pearson (1938) as a reference for Karl Pearson's test. Pearson 
[(1938), page 136] says, 

"Following what may be described as the intuitional line of approach, Pearson (1933) 
suggested as suitable test criterion one or other of the products 

Ql = 2/12/2 • • • 2/n 

or 



Qi = (l-2/i)(l-2/2)---(l-2/»)-" 
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The quotation above omits two equation numbers and two footnotes but is 
otherwise verbatim. In the notation of this paper, Q\ = YYJLi Pi an d Qi = 
rij=i(l —Pj)- E. Pearson cites K. Pearson's 1933 paper, although it appears 
that he should have cited the 1934 paper instead, because the former has 
only Qi, while the latter has Q\ and 

Birnbaum (1954) appears to have read E. Pearson's article as supplying 
two different proposals of K. Pearson, and then chose the one based on Q^, 
rejecting for large values of that product. 

In an article published after Pearson (1933) but before Pearson (1934), 
David (1934), page 2, revisits the 12 numerical examples computed by 
Pearson (1933) and reports that in 4 of those, Pearson made a wrong guess 
as to which of Qi and Q[ would be smaller: 

"Thus in 8 of the 12 illustrations the more stringent direction of the probability 
integrals was selected by mere inspection. In the other 4 cases B ought to have been 
taken instead of A, but in none of these four cases was the difference such as to upset 
the judgment as to randomness deduced from A." 

Pearson (1933) computes Q\ all 12 times and does not mention that this is 
a guess as to which product is smaller. Thus it is David's paper in which 
min((5i,Q / 1 ) is first used (as opposed to Q\). One might therefore credit 
David with this test, as for example, Oosterhoff (1969) does. But David 
credits Pearson for the method. 

Birnbaum's conclusion about Pearson's test is now well established in the 
literature. Hedges and Olkin [(1985), page 38] write, 

"Several other functions for combining p-values have been proposed. In 1933 Karl 
Pearson suggested combining p-values via the product 

(l-pi)(l-pa)---(l-p fc )- 

Other functions of the statistics p* = Min{pi, 1 — pi}, i = 1, . . . , k, were suggested by 
David (1934) for the combination of two-sided test statistic, which treat large and small 
values of the pi symmetrically. Neither of these procedures has a convex acceptance 
region, so these procedures are not admissible for combining test statistics from the 
one-parameter exponential family." 

The pi in this quotation might refer to either pi or pi in the notation of this 
paper. If the former, then the product would be equivalent to Q R and would 
be admissible. If the latter, then the product would conform to the statistic 
that Birnbaum studies, but not the one Karl Pearson proposed. Furthermore, 
in that case the quantity p* = min(pj, 1 — Pi) vanishes at pi G {0, 1/2, 1} and 
takes its maximal value at pi G {1/4,3/4}, which does not make sense. 

6. Power comparisons. Admissibility of Q c is historically interesting, 
but for practical purposes we want to know how its power compares to 
other test statistics such as undirected ones, Stouffer based ones and some 
likelihood ratio tests developed below. 
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In this section, we compare the power of these tests at some alternatives of 
the form (A, . . . , A, 0, . . . , 0) for A > where k components are nonzero and 
m — k are zero. Alternatives of the form (—A, A, . . . , A, 0, . . . , 0) for A > 
are also investigated. 

Not surprisingly, concordant tests generally outperform their undirected 
counterparts. The most powerful method depends on the value of k. In ap- 
plications where we know roughly how many nonzero and discordant slopes 
to expect, we can then identify which method will be most powerful, using 
the methods in this section. 

The power of tests based on S , S and S c can be handled via the nor- 
mal CDF. The statistics Q L , Q R , Q u and S u are all sums of m independent 
nonnegative random variables. It is therefore possible to get a good approx- 
imation to their exact distributions via the fast Fourier transform (FFT). 
For each of them, we use the FFT first to find their critical level (exceeded 
with small probability a). The FFT is used in such a way as to get hard 
upper and lower bounds on the cumulative probabilities which yield hard 
upper and lower bounds for the critical value. Then a second FFT under 
the alternative hypothesis is used to compute their power. The upper limit 
of the power of some Q comes from the upper bound on the probability of 
Q exceeding the lower bound on its critical value. The lower limit of power 
is defined similarly. 

For Q c ', the upper and lower limits on the critical value come via (2.5) 
applied to the bounds for Q L and Q R . So do the upper and lower limits for 
the power. 

All the computations were also done by simple Monte Carlo, with 99.9% 
confidence intervals. Usually the 100% intervals from the FFT were narrower 
than the MC intervals. But in some cases where (2.5) is not very tight, such 
as concordant tests at modest power, the MC intervals came out narrower. 

6.1. Alternatives to meta-analysis. In the AGEMAP study all of the 
original data were present and so one could use them to form the usual test 
statistics instead of summing logs of p- values. To focus on essentials, suppose 
that we observe Zj ~ M{j3j, 1) for j = 1, . . . , m. 

Then some very natural ways to pool the data are via Z R = J^JLi^j, 
Z L = -Z R and Z u = J2f=i Z]- Of course Z L = ^h~S L and Z R = ^fks R 
but Z u is not equivalent to S u . We would use Z R if the alternative were 
01 = @2 = ■ ■ ■ = Pm > 0, or even if, within the alternative Ha, the positive 
diagonal were especially important. The test criterion Z u is a more prin- 
cipled alternative to Fisher's Q u = — 2Y^"^ 1 log(2$(— |Zj|)) which does not 
account for the Gaussian distribution of the test statistics. Like Q u it does 
not favor concordant alternatives. 

Marden (1985) presents the likelihood ratio test statistic (LRTS) for Hr 
versus H . It takes the form A R = jyjLi max(0, Zj) 2 . The LRTS for H L 
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Fig. 4. The left panel is the rejection region for A formed by combining two Gaussian 
likelihood ratio tests of Marden (1985). The right panel is rejection region for Pearson's 
test Q c . Both regions are with respect to one-tailed p -values (pi,p2) £ (0, l) 2 . 

versus H Q is A L = YJjLx max(0, -Zj) 2 . Here we use A c = max(A i , A R ), a 
concordant version of Marden's LRTS. 

Proposition 1. Let X = (X 1: . . . ,X m ) ~ U(0, l) m , Z j = ^~ 1 (X j ), and 
put A L = Y,f =1 max(0, -Zj) 2 , A R = max(0, Zj) 2 and A c = max(A L , 
A R ). Then {X\A C < A} is convex. ForA>0 let t a = Pr(xJ 2m) > A). Then 
2t a ~t\< Pr(A c > A) < 2t A - 

Proof. Convexity of the acceptance region for A c follows as in The- 
orem 1 by starting with convexity of max(0, $ -1 (Xj)) 2 and using Lem- 
mas 1 and 2. For the second claim, the same argument used to prove 
Theorem 2 applies here, starting with nondecreasing functions f\(X) = 
-E^imax^,^- 1 ^)) 2 and f 2 (X) = £™ =1 max(0, -<Z>~ 1 (X j)) 2 . □ 

Figure 4 compares the rejection regions for A c and Q c . Here we see that 
Q has more of its rejection region devoted to coordinated alternatives than 
does A c . Recalling Figure 2, we note that S has even more of its region 
devoted to coordinated alternatives than Q , and so Q c is in this sense 
intermediate between these two tests. 

Marden (1985) also presents an LRTS based on t-statistics. As the degrees 
of freedom increase the t-statistics rapidly approach the normal case consid- 
ered here. They are not, however, an exponential family at finite degrees of 
freedom. If experiment j gives a i-statistic of Tj on rij degrees of freedom, 
then the one-sided likelihood ratio test rejects Hq for large values of 

m 

(6.1) T R = J2( n J + l)log(l + max(T i ,0) 2 /^). 

3=1 
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Fig. 5. This figure shows power ranging from near to 1 for a test of Ho at level a = 0.01. 
The alternative hypothesis is Ha ■ The true parameter values [3 have k components A > 
and m — k components 0. Here m = 16 and the values of k are printed above their power 
curves. The black lines are for the usual \ 2 test statistic Z u — y\ Z| . The red lines are 
for Q c and the blue lines are for A c . For each curve, upper and lower bounds are plotted, 
but they almost overlap. 

This is (18) of Marden (1985), after correcting a small typographical error. 
For large n the summands in (6.1) expressed as functions of pj ~ U(0, 1) are 
very similar to max(0, Zj) 2 . 

6.2. Numerical examples. We have Hq : (3 = (0, . . . , 0) and Ha is (3 ^ 0. 
The tests will be made at level a = 0.01. This value is a compromise between 
the value 0.001 used in Zahn et al. (2007) and the more conventional value 
0.05. The ranking among tests is not very sensitive to a. 

Let m = 16 and suppose that = (A, . . . , A, 0, . . . , 0) G H A C R m for A > 
0. The estimates (3 are distributed as N{(3, I m ). The number of nonzero com- 
ponents is k S {2,4,8, 16}. As A increases the power of the tests increases. 
The results are shown in Figure 5. 

If k > 8, then Q c (red curves) has better power than A*^ (blue curves), 
while for small k, A c does better. The black curves are for the test statistic 
Z u . For k > 4, the concordant methods dominate Z u . In this example, S c 
has the best power for k = 16. Power for S c is not shown but is included on 
later plots. Tests based on Q c do poorly in the case with k = 2. 

Continuing this example, we now make A depend on k, so that we can get 
all values of k onto one plot. Specifically A = A& is chosen so that the usual 
test based on Z u has power exactly 0.8. Then the best method for small k 
arises from A^, the best for the largest k comes from S , while Q c is best 
in the middle range and is nearly best for large k. The central Stouffer test 
based on S u has power less than 0.8 over the whole range. The results are 
shown in Figure 6. 
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Finally, we consider the setting where k — 1 of the f3j equal to A^ > 0, 
while one of them is — A&. Again A& is chosen so that a test based on Z u 
has power 0.8. Figure 7 shows the results. For small k A c works best, while 
for larger k, Q c works best. The Stouffer test S c is best when k = 16, but 
loses power quickly as k decreases. 

An online supplement at http://stat.stanford.edu/~owen/reports/ 
KPearsonSupplement contains 32 plots like the figures shown here. These 
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Fig. 6. This figure shows the power of various tests of Ho:/3 = when (3 = (A, 
. . . , A, 0, . . . , 0) G R m . The number k of nonzero components ranges from 1 to m = 16 
on the horizontal axis. For each k, A was chosen so that a test based on Z u = y^ m . Zf 

has power 0.8 of rejecting Ho. Results are given for Q c (red), A c (blue), S c (black) and 
S u (green). 
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Fig. 7. This figure shows the power of various tests of Ho : /3 = when (3 = (—A, A, . . . , 
A, 0, . . . , 0) € R m . The number k of nonzero components ranges from 1 to m = 16 on the 
horizontal axis. There is always one negative component. For each k, A was chosen so 
that a test based on Z u = X/J-i has power 0.8 of rejecting Ho- Results are given for 
Q c (red), A c (blue), S c (black) and S u (green). In this figure Monte Carlo was used for 
Q c and A c . 
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figures are also given in the technical report Owen (2009). The cases con- 
sidered have me {4,8,12,16}, a 6 {0.01, 0.05} and power in {0.8,0.5}. The 
number k of nonzero components ranges from 1 to m. In one set of plots 
there are k positive components, while the other has k — 1 positive and 1 
negative component. 

Those other figures show the same patterns as the one highlighted here. 
Among the concordant tests, is best for small k, then Q c is best for 
medium sized k and finally S c is best for the largest k. When there is one 
negative component, is most adversely affected, and A c least. That is, 
the tests that gain the most from coordinated alternatives, lose the most 
from a discordant component. The case k = 2 is hardest when there is one 
negative component, for then (3 contains two nonzero components of opposite 
sign. The Stouffer test S u is never competitive with Z u , though the gap is 
small when k = m. 

When 2 < k < m = 4 and one component is negative, then (3 does not 
really have concordant signs, and in these cases Q c , A c and S c have less 
power than S u which has less power than Z u . 

6.3. The original application. The AGEMAP study of Zahn et al. (2007) 
used Q c to filter genes. That setting featured 8932 genes to be tested in 
m = 16 tissues with data from n = 40 animals (slightly fewer when some 
data were missing). There were 5 male and 5 female animals at each of 4 
ages. For gene i and tissue j there was a regression of 40 gene expression 
values on age, sex and an intercept. The p-value for gene i and tissue j was 
based on a t-test for the age coefficient, usually with 37 degrees of freedom. 

Of the 16 tissues investigated, only 9 appeared to show any aging, and so 
much of the analysis was done for just those 9. 

The biological context made it impossible to specify a single alternative 
hypothesis, such as Hl, Hr or Hl U Hr to use for all genes. Instead it 
was necessary to screen for interesting genes without complete knowledge 
of which tissues would behave similarly. Also, the fates of tissues differ in 
ways that were not all known beforehand. One relevant difference is that 
the thymus involutes (becomes fatty) while some other tissues become more 
fibrous. There are likely to be other differences yet to be discovered that 
could make one tissue age differently from others. 

The 8932 genes on any given microarray in the study were correlated. 
There were also (small) correlations between genes measured in two dif- 
ferent tissues for the same animal. It is, of course, awkward to model the 
correlations among an 8932 x 16 matrix of observations based on a sample 
of only 40 such matrices. An extensive simulation was conducted in Owen 
(2007). In that simulation, genes were given known slopes and then errors 
were constructed by resampling residual matrices for the 40 animals. By 
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resampling residuals, some correlations among genes and tissues were re- 
tained. In each Monte Carlo sample, the genes were ranked by all the tests, 
and ROC curves described which test method was most accurate at ranking 
the truly age-related genes ahead of the others. 

In that Monte Carlo simulation, there were k G {3, 5, 7} nonzero slopes 
out of m G {9, 16}. Various values of A were used. The Fisher-based tests 
consistently outperformed Stouffer-style tests, as statistical theory would 
predict. The concordant tests usually outperformed central tests, and were 
almost as good as the one-sided tests that we would use if we knew the 
common sign of the nonzero /3j. The exceptions were for cases with k = 3 
and m = 16 and large A. Then the Q u tests could beat the Q c tests. For 
k = 3 and m = 16 and small A, the undirected and concordant tests were 
close. The left-hand side of Figure 6 confirms that pattern. 

6.4. Power conclusions. The methods of this section can be used to com- 
pare different combinations of tests. Given precise information, such as a 
prior distribution ir(/3) for nonnull (5, one could employ a weighted sum of 
power calculations to estimate J Rm n(fi) Pr(Q > Q 1 ~ a \P) df3 for each test Q 
in a set of candidates. 

Given less precise qualitative information, that the alternatives are likely 
to be concordant, we can still make a rough ordering of the methods when 
we have some idea how many concordant nonnull hypotheses there may be. 
Of the concordant methods compared above, the LRT A c is best where 
there are a few concordant nonnulls, then Pearson's Q is best if we expect 
mostly concordant nonnulls, and finally Stouffer's method S came out best 
only when the concordant nonnulls were unanimous or nearly so. 

7. Recent literature. Although combination of p-values is a very old 
subject, there seems to be a revival of interest. Here, a few works related to 
the present setup are outlined. 

Whitlock (2005) takes the strong view that any discordant test means that 
the null hypothesis should not be rejected. He gives the example of inbred 
animals being significantly larger than outbred in one study but significantly 
smaller in another study, and says that the results should then cancel each 
other out. In some contexts, such as AGEMAP, such cancellation would not 
make sense. Whitlock (2005) has a strong preference for the Stouffer test 
over Fisher style of tests. He reports better performance for Stouffer-style 
tests in a simulation, which is not what statistical theory would predict. His 
computations, however, are not equivalent to the Fisher statistics reported 
here. For example, he reports that, 

"A result of P = 0.99 is as suggestive of a true effect as is a result of P — 0.01. Yet 
with Fisher's test, if we get two studies with P — 0.01, the combined P is 0.001, but 
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with two studies with P = 0.99 the combined P is 0.9998. One minus 0.9998 is 0.0002. 
The high P and low P results differ by an order of magnitude, yet the answer should 
be the same in both cases." 

Interpreting the P-values as one-tailed pVvalues, Fisher's test Q u uses 
Pj = 0.02 in both cases for a combined p- value of Pr(X( 4 ) > — 21og(0.02 2 )) = 
0.0035. It is reasonable to conclude that the poor performance of Fisher's 
test reported in Whitlock (2005) does not apply to Q u . 

Benjamini and Heller (2007) consider "partial conjunction" alternative 
hypotheses, 

(7.1) H r : Pj / for at least r values of j € {1, ... , m}. 

In that setting, one decides a priori that there must be at least r false nulls 
among the m component hypotheses for the finding to be useful. 

They present several tests based on combinations of all but the r — 1 
most significant of the tests. Their combination methods include Fisher's 
and Stouffer's as well as one due to Simes (1986). The partial conjunction 
null described above can be applied using pj, 1 — pj, or pj to get left, right 
and undirected versions. A concordant version could also be useful if it were 
only of interest to find hypotheses rejected in the same direction by at least 
r tests. To get a concordant version, one takes twice the smaller of the left- 
and right-sided combined p-values. 

For r > 1, nontrivial acceptance regions based on combinations of only the 
least significant m — r + 1 p-values are not convex because they include any 
point on any of the m coordinate axes in M. d . As a result, the tests are not 
admissible versus the point hypothesis Hq in the exponential family context. 
The alternatives H r in (7.1) for r > 1 are not simple null hypotheses though, 
and so the tests may be admissible for their intended use. 

8. Discussion. This paper has shown that Pearson's method Q c really is 
admissible, and perhaps surprisingly, is competitive with standard methods 
based on the raw data, not just the p-values. The context where it is compet- 
itive is one where the truly nonzero components of the parameter vector are 
predominantly of one sign. We have also studied a concordant LRT test A 
which performs well when the number of concordant alternatives is slightly 
less. 

Also a very simple Bonferroni calculation proved to be very accurate for 
finding critical values of tests. It is less accurate for computing modest power 
levels. 

In a screening setting like AGEMAP, we anticipate that noise artifacts 
could give rise to values Pj with arbitrary patterns of signs, while the true 
biology is likely to be dominated by concordant signs. In early stages of the 
investigation, false discoveries are considered more costly than false nondis- 
coveries because the former lead to lost effort. Later when the aging process 
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is better understood, there may be greater value in finding those genes that 
are strongly discordant. In that case, combination statistics which favor dis- 
cordant alternatives may be preferred. 

Finally, this work has uncovered numerous errors in earlier papers. I do 
not mean to leave the impression that the earlier workers were not careful, 
either in an absolute or relative sense. The subject matter is very slippery. 

APPENDIX: COMPUTATION 

We want to get the 1 — a quantile of the distribution of Q = Y^JLi Yj 
where Yj are independent but not necessarily identically distributed random 
variables on [0, oo). The case of random variables with a different lower 
bound, possibly — oo, is considered in a remark below. We suppose that 
Yj has cumulative distribution function Fj{y) = Pi(Yj < y) which we can 
compute for any value y £ [0, oo). 

A.l. Convolutions and stochastic bounds. Because the Yj are indepen- 
dent, we may use convolutions to get the distribution of Q. Convolutions may 
be computed rapidly using the fast Fourier transform. A very fast and scal- 
able FFT is described in Frigo and Johnson (2005) who make their source 
code available. Their FFT on N points is tuned for highly composite values 
of N (not just powers of 2) while costing at most 0(N\og(N)) time even 
for prime numbers N. Thus one does not need to pad the input sequences 
with zeros. 

There are several ways to apply convolutions to this problem. For a dis- 
cussion of statistical applications of the FFT, including convolutions of dis- 
tributions, see the monograph by Monahan (2001). The best-known method 
convolves the characteristic functions of the Yj to get that of Q and then in- 
verts that convolution. But that method brings aliasing problems. We prefer 
to convolve probability mass functions. This replaces the aliasing problem 
by an overflow problem that is easier to account for. 

We write F <g> G for the convolution of distribution functions F and G. 
Our goal is to approximate Fq = 0??L 1 i^. We do this by bounding each 

Fj between a stochastically smaller discrete CDF FJ and a stochastically 
larger one Fj~ , both defined below. Write F~ =^ Fj ^ F^~ for these stochastic 
inequalities. Then from 

m m 

(A.l) (g)Ff 4F Q 4<g)F+, 

j=i i=i 

we can derive upper and lower limits for Pr(Q > Q*) for any specific value 
of Q*. 

The support sets of F~ and Fj~ are 

S^jv = {0,7?,2j7, ...,(N- 1)77} and S+ N = S njN U {00}, 
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respectively, for r\ > 0. The upper limit has 

j{y) U y = oo. 

In the upper limit, any mass between the points of S^ N is pushed to the 
right. For the lower limit, we push mass to the left. If Fj has no atoms in 
Srj^N, then 



Fj(y + v), y/ve{0,l,2,...,N-2}, 
y=(N-l) V , 



and otherwise we use lim. z ^ y+rj ^ Fj(z) for the first N — 1 levels. We do not 
need to put mass at — oo in FJ because Fj has support on [0, oo). 

It should be more accurate to represent each Fj at values (i + 1/2)77 f° r 
< i < N and convolve those approximations [see Monahan (2001)]. But 
that approach does not give hard upper and lower bounds for Fq. 

Suppose that F and G both have support S^ N with probability mass 
functions / and g respectively. Then their convolution has support S^ 2 n~i- 
The mass at 00 in the convolution is (/® g)(oo) = /(oo) +g(oo) — /(oo)g(oo). 
The mass at multiples through 2 A — 2 times 77, is the ordinary convolution 
of mass functions / and g, 

k 

(f ® g){krf) = f(iv)g(( k - i)v)- 

The CDF F (8) G can then be obtained from the mass function f ® g. Thus 
the convolutions in (A.l) can all be computed by FFT with some additional 
bookkeeping to account for the atom at +00. 

When F and G have probability stored at N consecutive integer multiples 
of 77 > 0, then their convolution requires 2N — 1 such values. As a result, the 
bounds in (A.l) require almost mN storage. If we have storage for only N 
finite atoms the convolution could overflow it. We can save storage by trun- 
cating the CDF to support SZ~ N taking care to round up when convolving 
factors of the upper bound and to round down when convolving factors of 
the lower bound. 

For a CDF F with support S^ M where M > N, define \F~\ n with support 
S+ N by 

\F] N (ir))=F(ir)), < i < N, and [^^(00) = 1. 

That is, when rounding F up to [i 7 ]^, all the atoms of probability on A^r; 
through (M — 1)t? inclusive are added to the atom at +00. 
To round this F down to S^n, we may take 

[F\ N (irj)=F(irj), < i < A - 1 and [F\ N ((N - l)r]) = 1. 
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When rounding F down to L-^JN; an the atoms on Nr] through (M — l)i] 
and +00 are added to the atom at F({N — \)rj). This form of rounding 
never leaves an atom at +00 in the stochastic lower bound for a CDF. It 
is appropriate if the CDF being bounded is known to be proper. If the 
CDF to be bounded might possibly be improper with an atom at +00, then 
we could instead move only the atoms of F on Nr/ through (M — 1)77 to 
[F\n((N — 1)77), leave some mass at 00, and get a more accurate lower 
bound. 

The upper and lower bounds for Fq are now Fq and Fq~ , where 
F Q + =\ F Q~ 1)+ ® F ^N, J = h---,m, 

and Fq* 1 is the CDF of a point mass at 0. 

If all of the Fj are the same then one may speed things up further by 
computing Ff + via r — 1 FFTs in a repeated squaring sequence, and simi- 
larly for Ff ~ . For large m, only 0(log(7?7,)) FFTs need be done to compute 

Remark 2. If one of the Fj has some support in (— 00, 0) then changes 
are required. If Fj has support [— Aj, 00) for some Aj < 00 then we can work 
with the random variable Yj + Aj which has support [0, 00). The convolution 
of Fj and Fk then has support starting at — {Aj + Aj.). If Fj does not have 
a hard lower limit like Aj then we may adjoin an atom at —00 to the CDF 
representing its stochastic lower bound. As long as we never convolve a 
distribution with an atom at +00 with another distribution having an atom 
at —00, the results are well defined CDFs of extended real- valued random 
variables. 

A. 2. Alternative hypotheses. In this section, we get expressions for the 
CDFs Fj that need to be convolved. We suppose that (5j are independent 
M(Pj , 1) random variables for j = 1, . . . , m. The null hypothesis is Hq : f) = 
for (3 =(£1,..., (3 m ). 

The left, right and undirected test statistics take the form Y^JLi ^j; where 
Yj = t((3j) for a function t mapping R to [0, 00). Large values of Yj represent 
stronger evidence against Hq. The concordant test statistics are based on 
the larger of the left- and right-sided sums. 

The likelihood ratio tests A L , A R and A^ are sums of 

Y Lj = max(-/3j, 0) 2 , Y Rj = max(/3j, 0) 2 and Y Vj = /3 2 , 
respectively. After elementary manipulations, we find Fj for these tests via 
Pr(Y Lj <y) = H^y + (3j), Pv(Y Rj <y) = <D(^ - ft) 
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and 

Pr(Y Vj <y) = *{y/y - /?,) - ${-y/y- ft). 
The Fisher test statistics Q L , Q R and Q u are sums of 

Y Lj = -21og($(4)), Y Rj = -21og($(-4)) 

and 

y^ = -2io g (2*(-|^|)), 

respectively. The corresponding are given by 

Pr(Y Lj <y) - ^(e^ 2 )), 

Pr(Y Rj < y) = - ^(e"^ 2 )) 

and 

PriYuj <y) = - <5>-\\e-y/ 2 )) - $(J3j + ^(ie"^ 2 )). 

For three of the Stouffer statistics, no FFT is required because S R ~ 
Mirn- 1 / 2 x J2T=iPj, 1), S L = -S R and S c = \S R \. The remaining Stouffer 
statistic S u is the sum of Yjjj = \(3j\/^/m, with 

Pr(Y^ < y) = $(V^y - Pj) - v^y - 

A. 3. Convolutions for power calculations. The computations for this pa- 
per were done via convolution using N = 100,000 and r] = 0.001. Some ad- 
vantage might be gained by tuning N and rj to each case, but this was not 
necessary. The convolution approach allows hard upper and lower bounds 
for probabilities of the form Pr(Q < Q*) for given distributions Fj. For prac- 
tical values of N, the width of these bounds is dominated by discretization 
error in approximating F at N points. Empirically, it decreases like 0(1/N), 
for a given m, as we would expect because apart from some mass going to 
+oo, any mass being swept left or right moves at most 0(r]/N). For very 
large values of N, the numerical error in approximating Q^ 1 would become 
a factor. 

Each convolution was coupled with a Monte Carlo computation of N sam- 
ple realizations. Partly this was done to provide a check on the convolutions, 
but in some instances, the Monte Carlo answers were more accurate. 

The possibility for Monte Carlo to be sharper than convolution arises for 
test statistics like Q c = m.ax(Q R ,Q L ). Suppose that Q L and Q R are neg- 
atively associated and that we have bounds Fq~ 4 F^ 4 Fq + and Fq~ 4 
Fq 4 F R+ . Even if F^~ = F^ + and F R ~ = F R+ , we still do not get arbi- 
trarily narrow bounds for Fq. In particular, increasing N will not suffice to 
get an indefinitely small interval. 
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When Q L and Q R are negatively associated, we can then derive from 
Theorem 2 that Fq~ 4 =4 F^ + where for i = 0, . . . , N - 1, 

F$+(ir l )=ma X (0,F§ + (ir,)+F%+(ir,) - 1) 
andi^+(oo)=l. 

Surprisingly, this is often enough to get a very sharp bound on Q . But 
in some other cases, the Monte Carlo bounds are sharper. 

Monte Carlo confidence intervals for Pt(Q c > Q*) were computed by a 
formula for binomial confidence intervals in Agresti (2002), page 15. This 
formula is the usual Wald interval after adding pseudo-counts to the number 
of successes and failures. For a 100(1 — a)% confidence interval one uses 

7f ± - a/2)0r(l-7r)/JV, 

where n = (Ntt + (1 - a/2) 2 /2) /(N + <I>" 1 (1 - a/2) 2 ) , and vr is simply the 
fraction of times that Q c > Q* was observed in N trials. For a = 0.001, this 
amounts to adding $~ 1 (0.9995) 2 = 10.8 pseudo-counts split equally between 
successes and failures, to the N observed counts. 
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